<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>MinHash token filter | ElasticSearch 7.7 权威指南中文版</title>
	<meta name="keywords" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <meta name="description" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
	<script>
	var _link = 'analysis-minhash-tokenfilter.html';
    </script>
</head>
<body>
<div class="main-container">
    <section id="content">
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-minhash-tokenfilter.html" rel="nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-minhash-tokenfilter.html</a>, 原文档版权归 www.elastic.co 所有<br/>本地英文版地址: <a href="../en/analysis-minhash-tokenfilter.html" rel="nofollow" target="_blank">../en/analysis-minhash-tokenfilter.html</a></div>
                        <!-- start body -->
                  <div class="page_header">
<strong>重要</strong>: 此版本不会发布额外的bug修复或文档更新。最新信息请参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html" rel="nofollow">当前版本文档</a>。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch Guide [7.7]</a></span>
»
<span class="breadcrumb-link"><a href="analysis.html">Text analysis</a></span>
»
<span class="breadcrumb-link"><a href="analysis-tokenfilters.html">Token filter reference</a></span>
»
<span class="breadcrumb-node">MinHash token filter</span>
</div>
<div class="navheader">
<span class="prev">
<a href="analysis-lowercase-tokenfilter.html">« Lowercase token filter</a>
</span>
<span class="next">
<a href="analysis-multiplexer-tokenfilter.html">Multiplexer token filter »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="analysis-minhash-tokenfilter"></a>MinHash token filter<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc">edit</a>
</h2>
</div></div></div>

<p>Uses the <a href="https://en.wikipedia.org/wiki/MinHash" class="ulink" target="_top">MinHash</a> technique to produce a
signature for a token stream. You can use MinHash signatures to estimate the
similarity of documents. See <a class="xref" href="analysis-minhash-tokenfilter.html#analysis-minhash-tokenfilter-similarity-search" title="Using the min_hash token filter for similarity search">Using the <code class="literal">min_hash</code> token filter for similarity search</a>.</p>
<p>The <code class="literal">min_hash</code> filter performs the following operations on a token stream in
order:</p>
<div class="olist orderedlist">
<ol class="orderedlist">
<li class="listitem">
Hashes each token in the stream.
</li>
<li class="listitem">
Assigns the hashes to buckets, keeping only the smallest hashes of each
bucket.
</li>
<li class="listitem">
Outputs the smallest hash from each bucket as a token stream.
</li>
</ol>
</div>
<p>This filter uses Lucene’s
<a href="https://lucene.apache.org/core/8_5_1/analyzers-common/org/apache/lucene/analysis/minhash/MinHashFilter.html" class="ulink" target="_top">MinHashFilter</a>.</p>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-minhash-tokenfilter-configure-parms"></a>Configurable parameters<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<code class="literal">bucket_count</code>
</span>
</dt>
<dd>
(Optional, integer)
Number of buckets to which hashes are assigned. Defaults to <code class="literal">512</code>.
</dd>
<dt>
<span class="term">
<code class="literal">hash_count</code>
</span>
</dt>
<dd>
(Optional, integer)
Number of ways to hash each token in the stream. Defaults to <code class="literal">1</code>.
</dd>
<dt>
<span class="term">
<code class="literal">hash_set_size</code>
</span>
</dt>
<dd>
<p>
(Optional, integer)
Number of hashes to keep from each bucket. Defaults to <code class="literal">1</code>.
</p>
<p>Hashes are retained by ascending size, starting with the bucket’s smallest hash
first.</p>
</dd>
<dt>
<span class="term">
<code class="literal">with_rotation</code>
</span>
</dt>
<dd>
(Optional, boolean)
If <code class="literal">true</code>, the filter fills empty buckets with the value of the first non-empty
bucket to its circular right if the <code class="literal">hash_set_size</code> is <code class="literal">1</code>. If the
<code class="literal">bucket_count</code> argument is greater than <code class="literal">1</code>, this parameter defaults to <code class="literal">true</code>.
Otherwise, this parameter defaults to <code class="literal">false</code>.
</dd>
</dl>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-minhash-tokenfilter-configuration-tips"></a>Tips for configuring the <code class="literal">min_hash</code> filter<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<code class="literal">min_hash</code> filter input tokens should typically be k-words shingles produced
from <a class="xref" href="analysis-shingle-tokenfilter.html" title="Shingle token filter">shingle token filter</a>. You should
choose <code class="literal">k</code> large enough so that the probability of any given shingle
occurring in a document is low. At the same time, as
internally each shingle is hashed into to 128-bit hash, you should choose
<code class="literal">k</code> small enough so that all possible
different k-words shingles can be hashed to 128-bit hash with
minimal collision.
</li>
<li class="listitem">
<p>We recommend you test different arguments for the <code class="literal">hash_count</code>, <code class="literal">bucket_count</code> and
<code class="literal">hash_set_size</code> parameters:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
To improve precision, increase the <code class="literal">bucket_count</code> or
<code class="literal">hash_set_size</code> arguments. Higher <code class="literal">bucket_count</code> and <code class="literal">hash_set_size</code> values
increase the likelihood that different tokens are indexed to different
buckets.
</li>
<li class="listitem">
To improve the recall, increase the value of the <code class="literal">hash_count</code> argument. For
example, setting <code class="literal">hash_count</code> to <code class="literal">2</code> hashes each token in two different ways,
increasing the number of potential candidates for search.
</li>
</ul>
</div>
</li>
<li class="listitem">
By default, the <code class="literal">min_hash</code> filter produces 512 tokens for each document. Each
token is 16 bytes in size. This means each document’s size will be increased by
around 8Kb.
</li>
<li class="listitem">
The <code class="literal">min_hash</code> filter is used for Jaccard similarity. This means
that it doesn’t matter how many times a document contains a certain token,
only that if it contains it or not.
</li>
</ul>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-minhash-tokenfilter-similarity-search"></a>Using the <code class="literal">min_hash</code> token filter for similarity search<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<p>The <code class="literal">min_hash</code> token filter allows you to hash documents for similarity search.
Similarity search, or nearest neighbor search is a complex problem.
A naive solution requires an exhaustive pairwise comparison between a query
document and every document in an index. This is a prohibitive operation
if the index is large. A number of approximate nearest neighbor search
solutions have been developed to make similarity search more practical and
computationally feasible. One of these solutions involves hashing of documents.</p>
<p>Documents are hashed in a way that similar documents are more likely
to produce the same hash code and are put into the same hash bucket,
while dissimilar documents are more likely to be hashed into
different hash buckets. This type of hashing is known as
locality sensitive hashing (LSH).</p>
<p>Depending on what constitutes the similarity between documents,
various LSH functions <a href="https://arxiv.org/abs/1408.2927" class="ulink" target="_top">have been proposed</a>.
For <a href="https://en.wikipedia.org/wiki/Jaccard_index" class="ulink" target="_top">Jaccard similarity</a>, a popular
LSH function is <a href="https://en.wikipedia.org/wiki/MinHash" class="ulink" target="_top">MinHash</a>.
A general idea of the way MinHash produces a signature for a document
is by applying a random permutation over the whole index vocabulary (random
numbering for the vocabulary), and recording the minimum value for this permutation
for the document (the minimum number for a vocabulary word that is present
in the document). The permutations are run several times;
combining the minimum values for all of them will constitute a
signature for the document.</p>
<p>In practice, instead of random permutations, a number of hash functions
are chosen. A hash function calculates a hash code for each of a
document’s tokens and chooses the minimum hash code among them.
The minimum hash codes from all hash functions are combined
to form a signature for the document.</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-minhash-tokenfilter-customize"></a>Customize and add to an analyzer<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<p>To customize the <code class="literal">min_hash</code> filter, duplicate it to create the basis for a new
custom token filter. You can modify the filter using its configurable
parameters.</p>
<p>For example, the following <a class="xref" href="indices-create-index.html" title="Create index API">create index API</a> request
uses the following custom token filters to configure a new
<a class="xref" href="analysis-custom-analyzer.html" title="Create a custom analyzer">custom analyzer</a>:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<code class="literal">my_shingle_filter</code>, a custom <a class="xref" href="analysis-shingle-tokenfilter.html" title="Shingle token filter"><code class="literal">shingle</code>
filter</a>. <code class="literal">my_shingle_filter</code> only outputs five-word shingles.
</li>
<li class="listitem">
<code class="literal">my_minhash_filter</code>, a custom <code class="literal">min_hash</code> filter. <code class="literal">my_minhash_filter</code> hashes
each five-word shingle once. It then assigns the hashes into 512 buckets,
keeping only the smallest hash from each bucket.
</li>
</ul>
</div>
<p>The request also assigns the custom analyzer to the <code class="literal">fingerprint</code> field mapping.</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_shingle_filter": {      <a id="CO416-1"></a><i class="conum" data-value="1"></i>
          "type": "shingle",
          "min_shingle_size": 5,
          "max_shingle_size": 5,
          "output_unigrams": false
        },
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,          <a id="CO416-2"></a><i class="conum" data-value="2"></i>
          "bucket_count": 512,      <a id="CO416-3"></a><i class="conum" data-value="3"></i>
          "hash_set_size": 1,       <a id="CO416-4"></a><i class="conum" data-value="4"></i>
          "with_rotation": true     <a id="CO416-5"></a><i class="conum" data-value="5"></i>
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_shingle_filter",
            "my_minhash_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "fingerprint": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/959.console"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO416-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>Configures a custom shingle filter to output only five-word shingles.</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO416-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>Each five-word shingle in the stream is hashed once.</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO416-3"><i class="conum" data-value="3"></i></a></p>
</td>
<td align="left" valign="top">
<p>The hashes are assigned to 512 buckets.</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO416-4"><i class="conum" data-value="4"></i></a></p>
</td>
<td align="left" valign="top">
<p>Only the smallest hash in each bucket is retained.</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO416-5"><i class="conum" data-value="5"></i></a></p>
</td>
<td align="left" valign="top">
<p>The filter fills empty buckets with the values of neighboring buckets.</p>
</td>
</tr>
</table>
</div>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="analysis-lowercase-tokenfilter.html">« Lowercase token filter</a>
</span>
<span class="next">
<a href="analysis-multiplexer-tokenfilter.html">Multiplexer token filter »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>